Chunyuan Li

My research centers on multimodal intelligence, with a focus on large-scale language and vision training. Key contributions include LLaVA and its model family series, as well as foundational early work such as GroundingDINO, GLIP, GLIGEN, Florence, and Oscar.

My experience includes research roles at xAI, ByteDance, and Microsoft Research, Redmond. I earned my PhD in machine learning from Duke University under the guidance of Prof. Lawrence Carin, where my doctoral research explored deep generative models. I have also served the community as an Area Chair for NeurIPS, ICML, ICLR, EMNLP, TMLR and a Guest Editor of IJCV on ``the promises and dangers of large vision models''.


news

2025 Grok-3: Visual understanding and Realtime video in voice-mode.
2024 Exploring the boundaries of fully open-source VLMs to establish a mature recipe, documented in Blog Series and Github
  • LLaVA-NeXT, LLaVA-OneVision, LLaVA-Video, LLaVA-Critic
Developing the proprietary industry-leading VLM in image and video understanding: Seed-VL-1.5
Oct/Nov, 2023 LLaVA is upgraded:
  • LLaVA-1.5 achieves SoTA on 11 benchmarks among open-source VLMs. It utilizes all public data, completes training in ~1 day on a single 8-A100 node, and surpasses prior SoTA that use billion-scale data. [Project] [Paper] [Github] [Demo] [Model Zoo]
  • LLaVA-Interactive: Experience the future of human-AI multimodal interaction with an all-in-one demo for image chat, segmentation, generation and editing. [Project] [Paper] [Github] [Demo]
  • LLaVA-Plus expands the capabilities of LLaVA by learning to use external tools for creating multimodal agents. [Project] [Paper] [Github] [Demo]
September 20, 2023 A 110-page paper is released to share our perspective on multimodal models: ``Multimodal Foundation Models: From Specialists to General-Purpose Assistants''. This is based our CVPR 2023 Tutorial. [Note on Large Multimodal Models] [Slides] [YouTube] [Bilibili]
June 1, 2023 LLaVA-Med: Training a large language-and-vision assistant for biomedicine in one day. NeurIPS 2023 Datasets and Benchmarks Track (Spotlight)
April 17, 2023 Visual Instruction Tuning with GPT-4! We release LLaVA, a Large Language-and-Vision Assistant towards multimodal GPT-4 level capabilities. NeurIPS 2023 (Oral Presentation) [Project] [Paper] [Github] [Demo] [Data] [Model] [Scaling Note]
April 7, 2023 Instruction Tuning with GPT-4! a "first attempt" to use GPT-4 data for LLM self-instruct tuning. [Paper] [Github] [My Learnings]
March, 2023 CVPR 2023:
  • REACT improves foundation models on various vision tasks by customizing them with retrieval-augmented multimodal knowledge [Code] (Highlights, 2.5%)
  • GLIGEN enables a new capability for frozen text-to-image generation models: open-set grounding. [Demo] [Code] [YouTube]
  • X-Decoder: a generalist decoder for pixel, image and language [Demo] [Code]
Feb, 2023

CVPR2023 Workshop and Challenge on the 2nd Computer Vision in the Wild (CVinW). For those who are new to this topic, please check out the CVinW Reading List . [Workshop] [SGinW Challenge] [RF100 Challenge]

Oct 23, 2022

ECCV 2022 Workshop and Challenge on the 1st Computer Vision in the Wild (CVinW). Please check out the videos of this event at [YouTube] [BiliBili]. [Workshop] [ICinW Challenge] [ODinW Challenge]

Oct 17, 2022 "Vision-Language Pre-Training: Basics, Recent Advances, and Future Trends", A 100-page survey paper in Foundations and Trends® in Computer Graphics and Vision
Sep 16, 2022 NeurIPS 2022: K-LITE (Oral, 1%), ELEVATER and FocalNet. A team effort to push CVinW. :sparkles:; [CVPR Tutorial] :smile:
  • K-LITE demonstrates the effectiveness of external knowledge to improve language-image models in zero-/few-shot task transfer
  • ELEVATER is a platform with 20 image classification and 35 object detection public datasets for evaluating language-image models in task-level visual transfer. [Benchmark Website]
  • FocalNet [paper, code, demo, blog] - SoTA on COCO object detection with a simple attention-free architecture
Mar 25, 2022 Upcoming events as a co-organizer:
Mar 1, 2022 CVPR 2022:
June 17, 2021 EsViT chieves SoTA 81.3% top-1 on the ImageNet linear probe evaluation, outperforming prior arts with an order magnitude of higher throughput. [GitHub]